De-identifying a public use microdata file from the Canadian national discharge abstract database

نویسندگان

  • Khaled El Emam
  • David Paton
  • Fida Kamal Dankar
  • Günes Koru
چکیده

BACKGROUND The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records. METHODS Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy. RESULTS Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression. CONCLUSIONS The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quality Measurements of the Canadian Discharge Abstract Database

The Canadian Institute for Health Information (CIHI) is conducting the first national study of the quality of the Canadian hospital discharge data contained in the Discharge Abstract Database (DAD). The study is aimed at measuring discrepancies, identifying sources of error, and providing users of the study with statistically reliable information about the data quality of the DAD. The study wil...

متن کامل

Audit of Minimally Invasive Hysterectomy Rates: A Canadian Retrospective Cross-Sectional Database Review

Background: Minimally invasive hysterectomy is generally preferable to abdominal hysterectomy. The technicity index (TI) is the proportion of hysterectomies performed by minimally invasive surgery. Many centers globally have started to audit local TI as a quality indicator, but only a handful have published their results to help define international standards of care. <st...

متن کامل

Sampling with Synthesis: A New Approach for Releasing Public Use Census Microdata

Many statistical agencies disseminate samples of census microdata, i.e., data on individual records, to the public. Before releasing the microdata, agencies typically alter identifying or sensitive values to protect data subjects’ confidentiality, for example by coarsening, perturbing, or swapping data. These standard disclosure limitation techniques distort relationships and distributional fea...

متن کامل

Quality of administrative health databases in Canada: A scoping review.

OBJECTIVE Administrative health databases are increasingly used to conduct population-based health research and surveillance; this has resulted in a corresponding growth in studies about their quality. Our objective was to describe the characteristics of published Canadian studies about administrative health database quality. METHODS PubMed, Scopus, and Google Advanced were searched, along wi...

متن کامل

Cross-National Longitudinal Business Database: A Synthetic Data Approach

In most countries, statistical agencies do not release establishment-level business microdata because doing so represents too large a risk to establishments’ confidentiality. One potential approach for overcoming these risks is to release synthetic data. The US Census Bureau Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciencies, and C...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2011